Contact:
Peter-Paul de Wolf
Statistics Netherlands
P.O. Box 24500
2490 HA The Hague
The Netherlands
Phone: +31 70 337 5060
Last update: 10 Oct 2011
|
Task 7. Synthetic data files
Publication of synthetic -i.e. simulated- data is an alternative to masking original data
when protecting data against disclosure. The idea is to randomly generate data with the constraint
that certain statistics or internal relationships of the original dataset should be preserved.
Rubin (1993) addresses a new approach by generating fully synthetic data sets to guarantee
confidentiality. His idea was to treat all the observations from the sampling frame that are
not part of the sample as missing data and to impute them according to the multiple imputation framework.
Afterwards, several simple random samples from these fully imputed data sets are released to the public.
Because all imputed values are random draws from the posterior predictive distribution of the missing
values given the observed values, disclosure of sensitive information is nearly impossible, especially
if the released data sets don’t contain any real data. Another advantage of this approach is the sampling
design for the imputed data sets. As the released data sets can be simple random samples from the
population, the analyst doesn’t have to allow for a complex sampling design in his models.
With this approach the sampling weights can be completely removed from the released data;
this is a further advantage as these weights often carry disclosive information, especially for surveys
on enterprises (where the inclusion probability is usually proportional to size).
Rubin’s proposal was more completely developed in Raghunathan, Reiter, and Rubin (2003).
A simulation study of it was given in Reiter (2002). In Reiter (2005a) inference on synthetic data is
discussed and in Reiter (2005b) an application is given.
Further application of the multiple imputation philosophy is found in An and Little (2007),
where selected records showing high levels of a target variable are imputed, thus providing an
alternative to top-coding.
However, the quality of this method strongly depends on the accuracy of the model used to impute the
"missing" values. If the model doesn’t include all the relationships between the variables that are
of interest to the analyst or if the joint distribution of the variables is mis-specified, results
from the synthetic data sets can be biased. Furthermore, specifying a model that considers all the
skip patterns and constraints between the variables can be cumbersome if not impossible.
To overcome these problems, a related approach, suggested by Little (1993), replaces observed values
with imputed values only for variables that bear a high risk of disclosure or for variables that contain
especially sensitive information, leaving the rest of the data unchanged. This approach, discussed as
generating partially synthetic data sets in the literature (see also Reiter, 2003), has been adopted
for some data sets in the US (see for example Abowd and Woodcock, 2001, 2004, Abowd, Stinson and
Benedetto 206 or Kennickell, 1997).
To address the problem of model misspecification, another area of active research has been the
formulation of nonparametric statistical models. Reiter (2005c) proposes use of CART to estimate
nonparametrically the distribution that generates the synthetic data. Franconi and Polettini (2007)
have investigated use of Bayesian networks to generate synthetic data allowing for logical constraints
among categorical variables. Methods for generating partially synthetic data that maintain specific
statistics on certain sub-domains have been proposed also by Polettini (2003), within a semiparametric
framework. Both the latter methods have been implemented on enterprise microdata.
Drechsler et al. (2006) and Drechsler/Bender/Rässler (2007) describe an application to German datasets
by generating fully synthetic datasets and partial synthetic data sets to a panel of establishments.
Their results for a cross section are very promising.
A criticism often made about synthetic data is that they only preserve the relationships considered
in the model used to create them. That is, for data uses not anticipated by the data protector,
such as subdomain analysis, completely wrong results will be obtained. This is not so for SDC methods
based on masking. Recently, an interesting attempt at combining the advantages of synthetic data
(specifically the Burridge (2004) method) and masking has been made by Muralidhar and Sarathy (2007).
In the last few years, the use of synthetic data has attracted much more attention as an alternative
to micro data protection. The focus of this development and application of synthetic data is mainly
in the USA. It should be investigated whether this could be useful for Europe as well.
In all of the previously mentioned applications, further studies are needed to assess large scale
implementability of these methods, to study practical issues related to such implementation, namely
computational burden and time, and to check the data utility of the resulting synthetic data, especially
for what concerns the goodness of fit of the model and its ability to reproduce to a large extent
the observed relationships over subdomains. These aspects have to be carefully evaluated also
in the perspective to allow access to synthetic data, an instance where the analyst’s model is
unknown and possibly involves specific subpopulations.
The methods cited in the paragraphs above will be analyzed, case studies (e.g. on a sample of enterprises
stemming from an Italian survey) will be developed and an overview report with some recommendations
will be produced. If useful we will try to incorporate methods for generating synthetic data into
µ-ARGUS in the second year.
Partners: URV, AT and DE for the report, IT also for an example of synthetic data on a
specific survey and NL for the implementation in µ-ARGUS, DE and IT for testing the
feasibility of these methods.).
Deliverables: Report after year 1 and a possible implementation at the end of year 2
References:
J.M. Abowd, M. Stinson and G. Benedetto [2006]: Final Report to the Social Security Administration on the SIPP/SSA/IRS Public Use File
Project, mimeo, Washington.
J.M. Abowd and S.D. Woodcock [2001]: Disclosure limitation in longitudinal linked data. Confidentiality, Disclosure, and Data Access:
Theory and Practical Applications for Statistical Agencies. North-Holland, Amsterdam 215-277
J.M. Abowd and S.D. Woodcock [2004]: Multiply-Imputing Confidential Characteristics and File Links in Longitudinal Linked Data.
Privacy in Statistical Databases. Springer Verlag, New York 290-297
D. An and R. J. A. Little (2007) Multiple imputation: an alternative to top coding for statistical disclosure control,
Journal of the Royal Statistical Society, Series A, 170 (4): 923-940
J. Burridge (2004). Information preserving statistical obfuscation. Statistics and Computing, 13:321-327, 2003.
J. Drechsler, A. Dundler, S. Bender, S. Rässler and T. Zwick (2006). A New Approach for Disclosure Control in the IAB Establishment
Panel - Multiple Imputation for a Better Data Access,
UNECE Work Session on Statistical Data Editing (Bonn, Germany, 25-27 September 2006), Invited Paper.
J. Drechsler, S. Bender and S. Rässler (2007). Comparing Fully and Partially Synthetic Data Sets for Statistical Disclosure Control
in the German IAB Establishment Panel, mimeo, Nuremberg..
L. Franconi and S. Polettini (2007). Some experiences at Istat on data simulation. Proceedings of ISI Conference,
Lisbon, 23-29 August, 2007.
A.B. Kennickell [1997]: Multiple imputation and disclosure protection: The case of the 1995 Survey of Consumer Finances.
Record Linkage Techniques. National Academy Press, Washington D.C. 248-267
R.J.A. Little [1993]: Statistical Analysis of Masked Data, Journal of Official Statistics, Vol. 9, 407-426
K. Muralidhar and R. Sarathy (2007). Generating sufficiency-based non-synthetic perturbed data,
IEEE Transactions on Knowledge and Data Engineering (to appear).
S. Polettini (2003). Maximum entropy simulation for microdata protection, Statistics and Computing, 13(4), 307-320.
T. J. Raghunathan, J. P. Reiter, and D. Rubin (2003). Multiple imputation for statistical disclosure limitation.
Journal of Official Statistics, 19(1):1-16, 2003.
J. P. Reiter (2002). Satisfying disclosure restrictions with synthetic data sets. Journal of Official Statistics, 18(4):531-544, 2002.
J. P. Reiter (2003) Inference for partially synthetic, public use microdata sets. Survey Methodology, 29: 181-188.
J. P. Reiter (2005a). Releasing multiply-imputed, synthetic public use microdata: An illustration and empirical study.
Journal of the Royal Statistical Society, Series A, 168:185-205, 2005.
J. P. Reiter (2005b). Significance tests for multi-component estimands from multiply-imputed, synthetic microdata.
Journal of Statistical Planning and Inference, 131(2):365-377, 2005.
J. P. Reiter (2005c) Using CART to generate partially synthetic public use microdata. Journal of Official Statistics, 21, 441 - 462.
D. E. Rubin (1993). Discussion of statistical disclosure limitation. Journal of Official Statistics, 9(2):461-468, 1993.
|